test(qwp): fix flaky port-bind race in WebSocket client tests#47
Open
puzpuzpuz wants to merge 1 commit into
Open
test(qwp): fix flaky port-bind race in WebSocket client tests#47puzpuzpuz wants to merge 1 commit into
puzpuzpuz wants to merge 1 commit into
Conversation
TestPorts.findUnusedPort() opened an ephemeral ServerSocket, read its port, then closed it -- releasing the port. TestWebSocketServer.start() re-bound that port only later, so another process (or the very next findUnusedPort() call) could grab it in the gap, surfacing as a flaky "BindException: Address already in use" on loaded CI runners. TestWebSocketServer now binds its loopback listener eagerly in the constructor, holds it for the server's whole lifetime, and exposes the OS-assigned port via getPort(); start() just launches the accept loop on the already-bound socket. Owning the port from allocation to teardown closes the race window entirely. Every test that starts a server now reads server.getPort() instead of pre-allocating a port via findUnusedPort(); findUnusedPort() stays only where a test points a client at a deliberately-dead endpoint. The two raw ServerSocket fixtures that shared the same race (Always401Fixture, Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern; the other raw fixtures already used it. The change also removes two latent same-class hazards the migration surfaced: the sequential fixed-port allocPort() in DurableAckIntegrationTest and the port1 + 50 guess in CleanShutdownNoReplayTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The WebSocket client test suite intermittently failed with
java.net.BindException: Address already in use, e.g.:The cause is a time-of-check-to-time-of-use port race.
TestPorts.findUnusedPort()opened an ephemeralServerSocket, read its port, then closed it — releasing the port.TestWebSocketServer.start()re-bound that port only later, so in the gap another process (or the very nextfindUnusedPort()call, since nothing held the port) could take it. The failure is timing- and load-dependent, so it shows up as a rare flake on busy CI runners rather than deterministically.Fix
TestWebSocketServernow binds its loopback listener eagerly in the constructor, holds it for the server's whole lifetime, and exposes the OS-assigned port viagetPort().start()just launches the accept loop on the already-bound socket. Owning the port from allocation to teardown closes the race window entirely.Call sites no longer pre-allocate a port: each test reads
server.getPort()after constructing the server.findUnusedPort()remains only where a test points a client at a deliberately-dead endpoint (no server bound). The two rawServerSocketfixtures that carried the same race (Always401Fixture,Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern; the other raw fixtures already used it. The change also removes two latent same-class hazards the migration surfaced: the sequential fixed-portallocPort()inDurableAckIntegrationTestand theport1 + 50guess inCleanShutdownNoReplayTest.Tradeoffs
Eager binding means the listener is accept-able at the TCP level from construction rather than from
start(). For "server arrives late" tests this changes the cause of a pre-start()connection failure (upgrade timeout instead ofconnection-refused) but not the outcome: there is no accept loop until
start(), so the client still fails and retries, and the assertions are about end state. Those tests pass unchanged. A minor side effect is some benignHandshake failedserver logs when a stale pre-start()connection is drained after the accept loop starts.The diff is broad (19 files) because the helper is widely used, but each call-site change is mechanical: drop the pre-allocated port argument, read
getPort()instead.Test plan
coretest suite on the rebased branch: 2314 tests, 0 failures, 0 errors, 1 skippedQwpQueryClientWalkTrackerTestpassesInitialConnectAsyncTest,InitialConnectRetryTest,CloseDrainTest,RecoveryReplayTest,ReconnectTest) pass, including the "server arrives late" scenariosnew TestWebSocketServer(port, ...)calls and no rawServerSocketbound to a pre-selected port